JaeHyeonKim19

[컴퓨터구조론] 5. Large and Fast: Exploiting Memory Hierachy

2020-10-04


본 글은 영남대학교 최규상 교수님의 컴퓨터 구조 강의를 듣고 작성된 글입니다.

5.1 Introduction

  • Principle of Locality

    • Programs access a small proportion of their address space at any time
    • Temporal locality

      • Items accessed recently are likely to be accessed again soon
      • e.g., instructions in a loop, induction variables
    • Spatial locality

      • Items near those accessed recently are likely to be accessed soon
      • e.g., sequential instruction access, array data
  • Taking Advantage of Locality

    • Memory hierarchy
    • Store everything on disk
    • Copy recently accessed (and nearby) items from disk to smaller DRAM memory

      • Main memory
    • Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory

      • Cache memory attached to CPU
  • Memory Hierarchy Levels memory-hierachy-levels

    • Block (aka line): unit of copying

      • May be multiple words
    • If accessed data is present in upper level

      • Hit: access satisfied by upper level

        • Hit ratio: hits/accesses
    • If accessed data is absent

      • Miss: block copied from lower level

        • Time taken: miss penalty
        • Miss ratio: misses/accesses = 1 - hit ratio
      • Then accessed data supplied from upper level
  • Characteristics of the Memory Hierarchy characteristics-of-the-memroy-hierarchy
  • DRAM Technology

    • Data stored as a charge in a capacitor

      • Single transistor used to access the charge
      • Must periodically be refreshed

        • Read contents and write back
        • Performed on a DRAM "row" dram-technology
  • Advanced DRAM Organization

    • Bits in a DRAM are organized as a rectangular array

      • DRAM accesses an entire row
      • Burst mode: supply successive words from a row with reduced latency
    • Double data rate (DDR) DRAM

      • Transfer on rising and falling clock edges
    • Quad data rate (QDR) DRAM

      • Separate DDR inputs and outputs
  • DRAM Generations

    • 시간이 지남에따라 가격은 낮아지고 용량은 증가하였음
  • DRAM Performance Factors

    • Row buffer

      • Allows several words to be read and refreshed in parallel
    • Synchronous DRAM

      • Allows for consecutive accesses in bursts without needing to send each address
      • Improves bandwidth
    • DRAM banking

      • Allows simultaneous access to multiple DRAMs
      • Improves bandwidth
  • Main Memory Supporting Caches

    • Use DRAMs for main memory

      • Fixed width (e.g., 1 word)
      • Connected by fixed-width clocked bus

        • Bus clock is typically slower than CPU clock
    • Example cache block read

      • 1 bus cycle for address transfer
      • 15 bus cycles per DRAM access
      • 1 bus cycle per data transfer
    • For 4-word block, 1-word-wide DRAM

      • Miss penalty = 1 + 4 * 15 + 4 * 1 = 65 bus cycles
      • Bandwidth = 16 bytes / 65 cycles = 0.25 B/cycle
  • Increasing Memory Bandwidth increasing-memory-bandwidth

5.2 Memory Technologies

  • Review: Major Components of a Comupter review-major-components-of-a-computer
  • Memory Technology

    • Static RAM (SRAM)

      • 0.5ns ~ 2.5ns, $2000 ~ $5000 per GB
    • Dynamic RAM (DRAM)

      • 50ns ~ 70ns, $20 ~ $75 per GB
    • Magnetic disk (HD)

      • 5ms ~ 20ms, $0.20 ~ $2 per GB
    • Ideal memory

      • Access time of SRAM
      • Capacity and cost/GB of disk
  • Processor-Memory Performance Gap

    • 프로세서와 메모리의 성능 차이는 시간이 지날수록 양극화되어 왔다.
  • The "Memory Wall"

    • Processor vs DRAM speed disparity continues to grow the-memory-wall
    • Good memory hierachy (cache) design is increasingly important to overall performance

5.3 The Basics of Caches

  • Cache Memory

    • Cache memory

      • The level of the memory hierarchy closest to the CPU
    • Given accesses X1, ..., Xn-1, Xn
    • How do we know if the data is present?
    • Where do we look? cache-memory
  • Caching: A Simple First Example caching-a-simple-first-example
  • Direct Mapped Cache

    • Location determined by address
    • Direct mapped: only one choice

      • (Block address) modulo (#Blocks in cache)
    • Block is a power of 2

    • Use low-order address bits direct-mapped-cache
  • Tags and Valid Bits

    • How do we know which particular block is stored in a cache location?

      • Store block address as well as the data
      • Actually, only need the high-order bits
      • Called the tag
    • What if there is no data in a location?

      • Valid bit: 1 = present, 0 = not present
      • Initially 0
  • MIPS Direct Mapped Cache Example

    • One word blocks, cache size = 1K words (or 4KB) mips-direct-mapped-cache-example
  • Example: Larger Block Size

    • 64 blocks, 16 bytes/block

      • To what block number does address 1200 map?
    • Block address = 1200 / 16 = 75
    • Block number = 75 modulo 64 = 11 example-larger-block-size
  • Multiword Block Direct Mapped Cache

    • Four word/block, cache size = 1K words multiword-block-direct-mapped-cache
  • Block Size Considerations

    • Larger blocks should reduce miss rate

      • Due to spatial locality
    • But in a fixed-sized cache

      • Larger blocks -> fewer of them

        • More competition -> increased miss rate
      • Larger blocks -> pollution
    • Larger miss penalty

      • Can override benefit of reduced miss rate
      • Early restart and critical-word-first can help
  • Cache Misses

    • On cache hit, CPU proceeds normally
    • On cache miss

      • Stall the CPU pipeline
      • Fetch block from next level of hierachy
      • Instruction cache miss

        • Restart instruction fetch
      • Data cache miss

        • Compleete data access
  • Write-Through

    • On data-write hit, could just update the block in cache

      • But then cache and memory would be inconsistent
    • Write through: also update memory
    • But makes writes take longer

      • e.g., if base CPI = 1, 10% of instructions are stores, write to memory takes 100 cycles

        • Effective CPI = 1 + 0.1 * 100 = 11
    • Solution: write buffer

      • Holds data waiting to be written to be written memory
      • CPU continues immediately

        • Only stalls on write if write buffer is already full
  • Write-Back

    • Alternative: On data-write hit, just update the block in cache

      • Keep track of whether each block is dirty
    • When a dirty block is replaced

      • Write it back to memory
      • Can use a write buffer to allow replacing block to be read first
  • Wirte Allocation

    • What should happen on a write miss?
    • Alternatives for write-through

      • Allocate on miss: fetch the block
      • Write around: don't fetch the block

        • Since programs often write a whole block before reading it (e.g., initialization)
    • For write-back

      • Usually fetch the block
  • Example: Intrinsity FastMATH

    • Embedded MIPS processor

      • 12-stage pipeline
      • Instruction and data access on each cycle
    • Split cache: separate I-cache and D-cache

      • Each 16KB: 256 blocks * 16 words/block
      • D-cache: write-through or write-back
    • SPEC2000 miss rates

      • I-cache: 0.4%
      • D-cache: 11.4%
      • Weighted average: 3.2% example-instrinsity-fastmath

5.4 Measuring and Improving Cache Performance

  • Measuring Cache Performance

    • Components of CPU time

      • Program execution cycles

        • Includes cache hit time
      • Memory stall cycles

        • Mainly from cache misses
    • With simplifying assumptions:

      Memory stall cycles
      = Memory accesses/Program * Missrate * Miss penalty
      = Instructions/program * Misses/Instruction * Miss penalty
  • Cache Performance Example

    • Given

      • I-cache miss rate = 2 %
      • D-cache miss rate = 4 %
      • Miss penalty = 100 cycles
      • Base CPI (ideal cache) = 2
      • Load & stores are 36% of instructions
    • Miss cycles per instruction

      • I-cache: 0.02 * 100 = 2
      • D-cache: 0.36 * 0.04 * 100 = 1.44
    • Actual CPI = 2 + 2 + 1.44 = 5.44

      • Ideal CPU is 5.44/2 = 2.72 times faster
  • Average Access Time

    • Hit time is also important for performance
    • Average memory access time (AMAT)

      • AMAT = Hit time + Miss rate * Miss penalty
    • Example

      • CPU with 1ns clock, hit time = 1 cycle, miss penalty = 20 cycles, I-cache miss rate = 5%
      • AMAT = 1 + 0.05 * 20 = 2ns

        • 2 cycles per instruction
  • Performance Summary

    • When CPU performance increased

      • Miss penalty becomes more significant
    • Decreasing base CPI

      • Greater proportion of time spent on memory stalls
    • Increasing clock rate

      • Memory stalls account for more CPU cycles
    • Can't neglect cache behavior when evaluating system performance
  • Associative Caches

    • Fully associative

      • Allow a given block to go in any cache entry
      • Requires all entries to be searched at once
      • Comparator per entry (expensive)
    • n-way set associative

      • Each set contains n entries
      • Block number determines which set

        • (Block number) modulo (#Sets in cache)
      • Search all entries in a given set at once
      • n comparators (less expensive)
  • Associative Cache Example associative-cache-example
  • Spectrum of Associativity

    • For a cache with 8 entries sepctrum-of-associativity
  • Associativity Example

    • Compare 4-block caches

      • Direct mapped, 2-way set associative, fully associative
      • Block access sequence: 0, 8, 0, 6, 8
    • Direct mapped direct-mapped
    • 2-way set associative 2-way-set-associative
    • Fully associative fully-associative
  • How Much Associativity

    • Increased associativity decreases miss rate

      • But with diminishing returns
    • Simulation of a system with 64KB

      • D-cache, 16-word blocks, SPEC2000

        • 1-way: 10.3%
        • 2-way: 8.6%
        • 4-way: 8.3%
        • 8-way: 8.1%
  • Set Associative Cache Organization set-associative-cache-organization
  • Replacement Policy

    • Direct mapped: no choice
    • Set associative

      • Prefer non-valid entry, if there is one
      • Otherwise, choose among entries in the set
    • Least-recently used (LRU)

      • Choose the one unused for the longest time

        • Simple for 2-way, manageable for 4-way, too hard beyond that
    • Random

      • Gives approximately the same performance as LRU for high associativity
  • Multilevel Caches

    • Primary cache attached to CPU

      • Small, but fast
    • Level-2 cache services misses from primary cache

      • Larger, slower, but still faster than main memory
    • Main memory services L-2 cashe misses
    • Some high-end systems include L-3 cache
  • Multilevel Cache Example

    • Given

      • CPU base CPI = 1, clock rate = 4GHz
      • Miss rate/instruction = 2%
      • Main memory access time = 100ns
    • With just primary cache

      • Miss penalty = 100ns/0.25ns = 400 cycles
      • Effective CPI = 1 + 0.02 * 400 = 9
    • Now add L-2 cache

      • Access time = 5ns
      • Global miss rate to main memory = 0.5%
    • Primary miss with L-2 hit

      • Penalty = 5ns/0.25ns = 20cycles
    • Primary miss with L-2 miss

      • Extra penalty = 500 cycles
    • CPI = 1 + 0.02 * 20 + 0.005 * 400 = 3.4
    • Performance ratio = 9/3.4 = 2.6
  • Multilevel Cache Considerations

    • Primary cache

      • Focus on minimal hit time
    • L-2 cache

      • Focus on low miss rate to avoid main memory access
      • Hit time has less overall impact
    • Results

      • L-1 cache usually smaller than a single cache
      • L-1 block size smaller than L-2 block size
  • Interactions with Advanced CPUs

    • Out-of-order CPUs can execute instructions during cache miss

      • Pending store stays in load/store unit
      • Dependent instructions wait in reservation stations

        • Independent instructions continue
    • Effect of miss depends on program data flow

      • Much harder to analyes
      • Use system simulation
  • Interactions with Software

    • Misses depend on memory access patterns

      • Algorithm behavior
      • Compiler optimization for memory access

5.6 Virtual Machines

  • pass

5.7 Virtual Memory

  • Virtual Memory

    • Use main memory as a "cache" for secondary (disk) storage

      • Managed jointly by CPU hardware and the operating system (OS)
    • Programs share main memory

      • Each gets a private virtual address space holding its frequently used code and data
      • Protected fro other programs
    • CPU and OS translate virtual addresses to phsyical addresses

      • VM "block" is called a page
      • VM translation "miss" is called a page fault
  • Address Translation

    • Fixed-size pages (e.g., 4K) address-translation
  • Virtual Addressing with a Cache

    • Thus it takes an extra memory access to translate a VA to a PA virtual-addressing-with-a-cache
    • This makes memory (cache) accesses very expensive (if every access was really two accesses)
    • The hardware fix is to use a Translation Lookaside Buffer (TLB) - a small cache that keeps track of recently used address maapings to avoid having to do a page table lookup
  • Page Fault Penalty

    • On page fault, the page must be fetched from disk

      • Takes millions of clock cycles
      • Handled by OS code
    • Try to minimize page fault rate

      • Fully associative placement
      • Smart replacement algorithms
  • Page Tables

    • Stores placement information

      • Array of page table entries, indexed by virtual page number
      • Page table register in CPU points to page table in physical memory
    • If page is present in memory

      • PTE stores the physical page number
      • Plus other status bits (referenced, dirty, ...)
    • If page is not present

      • PTE can refer to location in swap space on disk
  • Two Programs Sharing Physical Memory

    • A Program's address space is divided into pages (all one fixed size) or segments (variable sizes)

      • The starting location of each page (either in main memory or in secondary memory) is contained in the program's page table two-programs-sharing-physical-memory
  • Translation Using a Page Table translation-using-a-page-table
  • Mapping Pages to Storage mapping-pages-to-storage
  • Replacement and Writes

    • To reduce page fault rate, prefer least-recently used (LRU) replacement

      • Reference bit (aka use bit) in PTE set to 1 on access to page
      • Periodically cleared to 0 by OS
      • A page with reference bit = 0 has not been used recently
    • Disk writes take millions of cycles

      • Block at once, not individual locations
      • Write through is impractical
      • Use write-back
      • Dirty bit in PTE set when page is written
  • Fast Translation Using a TLB

    • Address translation would appear to require extra memory references

      • One to access the PTE
      • Then the actual memory access
    • But access to page tables has good locality

      • So use a fast cache of PTEs within the CPU
      • Called a Translation Look-aside Buffer (TLB)
      • Typical: 16-612 PTEs, 0.5-1 cycle for hit, 10-100 cycles for miss, 0.01%-1% miss rate
      • Misses could be handled by hardware or software
  • A TLB in the Memory Hierarchy a-tlb-in-the-memory-hierarchy

    • A TLB miss - is it a page fault or merely a TLB miss?

      • If the page is loaded into main memory, then the TLB miss can be handled (in hardware or software) by loading the translation information from the page table into the TLB

        • Takes 10's of cycles to find and load the translation info into the TLB
      • If the page is not in main memory, then it's a true page fault

        • Takes 1,000,000's of cycles to service a page fault
    • TLB misses are much more frequent than true page faults
  • Fast Translation Using a TLB fast-translation-using-a-tlb
  • TLB Misses

    • If page is in memory

      • Load the PTE from memory and retry
      • Could be handled in hardware

        • Can get complex for more complicated page table structures
      • Or in software

        • Raise a special exception, with optimized handler
    • If page is not in memory (page fault)

      • OS handles fetchiing the page and updating the page table
      • Then restart the faulting instruction
  • TLB Miss Handler

    • TLB miss indicates

      • Page present, but PTE not in TLB
      • Page not preset
    • Must recognize TLB miss before destination register overwritten

      • Raise exception
    • Handler copies PTE from memory to TLB

      • Then restarts instruction
      • If page not present, page fault will occur
  • Page Fault Handler

    • Use faulting virtual address to find PTE
    • Locate page on disk
    • Choose page to replace

      • If dirty, write to disk first
    • Read page into memory and update page table
    • Make process runnable again

      • Restart from faulting instruction
  • TLB and Cache Interaction

    • If cache tag uses physical address

      • Need to translate before cache lookup
    • Alternative: use virtual address tag

      • Complications due to aliasing

        • Different virtual addresses for shared physical address tlb-and-cache-interaction
  • Memory Protection

    • Different tasks can share parts of their virtual address spaces

      • But need to protect against errant access
      • Requires OS assistance
    • Hardware support for OS protection

      • Privileged supervisor mode (aka kernel mode)
      • Privileged instructions
      • Page tables and other state information only accessible in supervisor mode
      • System call exception (e.g., syscall in MIPS)

5.8 A Common Framework for Memory Hierarchies

  • The Memory Hierarchy

    • Common principles apply at all levels of the memory hierarchy

      • Based on notions of caching
    • At each level in the hierarchy

      • Block placement
      • Finding a block
      • Replacement on a miss
      • Write policy
  • Block Placement

    • Determined by associativity

      • Direct mapped (1-way associative)

        • One choice for placement
      • n-way set associative

        • n choices within a set
      • Fully associative

        • Any location
    • Higher associativity reduces miss rate

      • Increases complexity, cost, and access time
  • Finding a Block finding-a-block

    • Hardware caches

      • Reduce comparisons to reduce cost
    • Virtual memory

      • Full table lookup makes full associativity feasible
      • Benefit in reduced miss rate
  • Replacement

    • Choice of entry to replace on a miss

      • Least recently used (LRU)

        • Complex and costly hardware for high associativity
      • Random

        • Close to LRU, easier to implement
    • Virtual memory

      • LRU approximation with hardware support
  • Wirte Policy

    • Write-through

      • Update both upper and lower levels
      • Simplifies replacement, but may require write buffer
    • Write-back

      • Update upper level only
      • Update lower level when block is replaced
      • Need to keep more state
    • Virtual memory

      • Only write-back is feasible, given disk write latency
  • Sources of Misses

    • Compulsory misses (aka cold start misses)

      • First access to a block
    • Capacity misses

      • Due to finite cache size
      • A replaced block is later accessed again
    • Conflict misses (aka collision misses)

      • In a non-fully associative cache
      • Due to competition for entries in a set
      • Would not occur in a fully associative cache of the same total size
  • Cache Design Trade-offs cache-design-trade-offs

5.9 Using a Finite State Machine to Control A Simple Cache

  • Cache Control

    • Example cache characteristics

      • Direct-mapped, write-back, write allocate
      • Block size: 4 words (16 bytes)
      • Cache size: 16 KB (1024 blocks)
      • 32-bit byte addresses
      • Valid bit and dirty bit per block
      • Blocking cache

        • CPU waits until access is complete cache-control
  • Interface Signals interface-signals
  • Finite State Machines finite-state-machines

    • Use an FSM to sequence control steps
    • Set of states, transition on each clock edge

      • State values are binary encoded
      • Current state stored in a register
      • Next state = fn(current state, surrent inputs)
    • Control output signals = f0(current state)
  • Cache Controller FSM cache-controller

5.10 Parallelism and Memory Hierarchies: Cache Coherence

  • Cache Coherence Problem

    • Suppose two CPU cores share a physical address space

      • Write-through caches cache-coherence-problem
  • Coherence Defined

    • Informally: Reads return most recently written value
    • Formally:

      • P writes X; P reads X (no intervening writes) -> read returns written value
      • P1 writes X; P2 reads X (sufficiently later) -> read returns written value

        • c.f. CPU B reading X after step 3 in example
      • P1 writes X, P2 writes X -> all processors see writes in the same order

        • End up with the same final value for X
  • Cache Coherence Protocols

    • Operations performed by caches in multiprocessors to ensure coherence

      • Migration of data to local caches

        • Reduces bandwidth for shared memory
      • Replication of read-shared data

        • Reduces contention for access
    • Snooping protocols

      • Each cache monitors bus reads/writes
    • Directory-based protocols

      • Caches and memory record sharing status of blocks in a directory
  • Invalidating Snooping Protocols

    • Cache gets exclusive access to a block when is to be written

      • Broadcasts an invalidate message on the bus
      • Subsequent read in another cache misses

        • Owning cache supplies updated value invalidating-snooping-protocols
  • Memory Consistency

    • When are writes seen by other processors

      • "Seen" means a read returns the written value
      • Can't be instantaneously
    • Assumptions

      • A write completes only when all processors have seen it
      • A processor does not reorder writes with other accesses
    • Consequence

      • P writes X then writes Y -> all processors that see new Y also see now X
      • Processors can reorder reads, but not writes

5.13 The ARM Cortex-A8 and Intel Core i7 Memory Hierarchies

  • pass

5.14 Going Faster: Cache Blocking and Matrix Multiply

  • pass

5.15 Fallacies and Pitfalls

  • Pitfalls

    • Byte vs. word addressing

      • Example: 32-byte direct-mapped cache, 4-byte blocks

        • Byte 36 maps to block 1
        • Word 36 maps to block 4
    • Ignoring memory system effects when writing or generating code

      • Example: iterating over rows vs. columns of arrays
      • Large strides result in poor locality
    • In multiprocessor with shared L2 or L3 cache

      • Less associativity than cores results in conflict misses
      • More cores -> need to increase associativity
    • Using AMAT to evaluate performance of out-of-order processors

      • Ignores effect of non-blocked accesses
      • Instead, evaluate performance by simulation
    • Extending address range using segments

      • E.g., Intel 80286
      • But a segment is not always big enough
      • Makes address arithmetic complicated
    • Implementing a VMM on an ISA not designed for virtualization

      • E.g., non-privileged instructions accessing hardware resources
      • Either extend ISA, or require guest OS not to use problematic instructions

5.16 Concluding Remarks

  • Concluding Remarks

    • Fast memories are small, large memories are slow

      • We really want fast, large memory
      • Caching gives this illusion
    • Principle of locality

      • Programs use a small part of their memory space frequently
    • Memory hierarchy

      • L1 cache <-> L2 cache <-> ... <-> DRAM memory <-> disk
    • Memory system design is critical for multiprocessors